Identifying influential spreaders in complex networks 
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Networks portray a multitude of interactions through which people meet, ideas 
are spread, and infectious diseases propagate within a society flrM]- Identifying 
the most efficient "spreaders" in a network is an important step to optimize the 
use of available resources and ensure the more efficient spread of information. 
Here we show that, in contrast to common belief, there are plausible circum- 
stances where the best spreaders do not correspond to the best connected people 
or to the most central people (high betweenness centrality) [61-4X0] . Instead, we 
find: (i) The most efficient spreaders are those located within the core of the 



network as identified by the fc-shell decomposition analysis [jl 1|— H3I] . (ii) When 
multiple spreaders are considered simultaneously, the distance between them 
becomes the crucial parameter that determines the extent of the spreading. 
Furthermore, we find that infections persist in the high A>shells of the network, 
even in the case where recovered individuals do not develop immunity. Our anal- 
ysis provides a plausible route for an optimal design of efficient dissemination 
strategies. 

Spreading is a ubiquitous process which describes many important activities in society 
-[5] . The knowledge of the spreading pathways through the network of social interactions is 
crucial for developing efficient methods to either hinder spreading in the case of diseases, or 
accelerate spreading in the case of information dissemination. Indeed, people are connected 
according to the way they interact with each other in society and the large heterogeneity of 
the resulting network greatly determines the efficiency and speed of spreading. In the case 



of networks with a broad degree distribution (number of links per node) p, it is believed 
that the most connected people (hubs) are the key players being responsible for the largest 
scale of the spreading process Furthermore, in the context of social network theory, 

the importance of a node for spreading is often associated with the betweenness centrality, a 
measure of how many shortest paths cross through this node, which is believed to determine 

nn 

who has more 'interpersonal influence' on others |9L 1 1 Of] - 

Here we argue that the topology of the network organization plays an important role such 
that there are plausible circumstances under which the highly connected nodes or the highest 
betweenness nodes have little effect in the range of a given spreading process. For example, 
if a hub exists at the end of a branch at the periphery of a network, it will have a minimal 
impact in the spreading process through the core of the network, while a less connected 
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person who is strategically placed in the core of the network will have a significant effect 
that leads to dissemination through a large fraction of the population. In order to identify the 
core and the periphery of the network we use the fc-shell (also called k-core) decomposition 



of the network 



llMl4f] . Examining this quantity in a number of real networks allows us 



to identify the best individual spreaders in the network when the spreading originates in a 
single node. For the case of a spreading process originating in many nodes simultaneously 
we show that we can further improve the efficiency by considering spreading origins located 
at a determined distance from each other. 

We study real-world complex networks that represent archetypical examples of social 
structures. We investigate (1) the friendship network between 3.4 million members of the 



LiveJournal. com community 



151 ] , (ii) the network of email contacts in the Computer Science 



Department of the University College London (Zhou, S., private communication), (iii) the 
contact network of inpatients (CNI) collected from hospitals in Sweden [3], and (iy) the 
network of actors who have co-starred in movies labeled by imdb.com as adult 13] (see 
Supplementary Information Section [J for details). 

To study the spreading process we apply the Susceptible- Infectious- Recovered (SIR) and 



Susceptible- Infectious-Susceptible (SIS) models [2|,|3|,ll8| on the above networks (see Methods 
section). These models have been used to describe disease spreading as well as information 
and rumor spreading in social processes where an actor constantly needs to be reminded 
[l9l | . We denote the probability that an infectious node will infect a susceptible neighbor as 
P. In our study we use relatively small values for (3, so that the infected percentage of the 
population remains small. In the case of large /3 values, where spreading can reach a large 
fraction of the population, the role of individual nodes is no longer important and spreading 
would cover almost all the network, independently of where it originated from. 

he location of a node in the network is obtained using the fc-shell decomposition analysis 
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13]. This process assigns an integer index or coreness, ks, to each node representing 



its location according to successive layers (fc-shells) in the network. The ks index is a 
quite robust measure and the nodes ranking is not influenced significantly in the case of 
incomplete information (for details see Si-Fig. [6] in SI-Section|TI]). Small values of ks define 
the periphery of the network and the innermost network core corresponds to large ks (see 
Fig. [Jji and Si-Section [TTJ) . Figures [Tb-d illustrate the fact that the size of the population 
infected in a spreading process (shown in this example in the CNI network) is not necessarily 
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related to the degree of the node, k, where the spreading have started. Spreading may be 
very different even when it starts from hubs of similar degree as comparatively shown in 
Figs. [Tb and c. Instead, the location of the spreading origin given by its k s index predicts 
more accurately the size of the infected population. For instance, Figs. [Tb anddH show that 
nodes in the same ks layer produce similar spreading areas even if they have different k (by 
definition, in a given layer there could be many nodes with k > ks)- 

The above example suggests that the position of the node relative to the organization of 
the network determines its spreading influence more than a local property of a node, like 
the degree k. To quantify the influence of a given node i in an SIR spreading process we 
study the average size of the population M, infected in an epidemic originating at node i 
with a given (ks, k). The infected population is averaged over all the origins with the same 
(ks, k) values: 



where T(ks, k) is the union of all N(ks, k) nodes with (ks, k) values. 

The analysis of M(ks, k) in the studied social networks reveals three general results (see 
Fig. [2]): (a) For a fixed degree, there is a wide spread of M(k s ,k) values. In particular, 
there are many hubs located in the periphery of the network (large k, low ks) that are 
poor spreaders, (b) For a fixed ks, M(ks, k) is approximately independent of the degree of 
the nodes. This result is revealed in the vertically layered structure of M(ks, k) suggesting 
that infected nodes located in the same fc-shell produce similar epidemic outbreaks M(ks, k) 
independent of the value of k of the infection origin, (c) The most efficient spreaders are 
located in the inner-core of the network (large ks region) fairly independently of their degree. 
These results indicate that the fc-shell index of a node is a better predictor of spreading 
influence. When an outbreak starts in the core of the network (large ks) there exist many 
pathways through which a virus can infect the rest of the network; this result is valid 
regardless of the node degree. The existence of these pathways implies that during a typical 
epidemic outbreak from a random origin, nodes located in high ks layers are more likely 
to be infected and they will be infected earlier than other nodes (see Si-Section IIIII) . The 
neighborhood of these nodes makes them more efficient in sustaining an infection at the 
early stages, allowing thus the epidemics to reach a critical mass that will allow it to fully 
develop. Similar results on the efficiency of high-fc^ nodes are obtained from the analysis 




(1) 
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of M(ks,CB) in Fig. |2j where Cb is the betweenness centrality of a node in the network 



[9 



9|, [10(: the value of Cb is not a good predictor for spreading efficiency. 

To quantify the importance of k$ in spreading we calculate the "imprecision functions" 
€k s (p), 6fc(p), and €c B (p)- These functions estimate for each of the three indicators k s , k, 
and Cb how close to the optimal spreading is the average spreading of the pN (0 < p < 1) 
chosen origins in each case, (see Methods and Si-Section lIVp . The strategy to predict the 
spreading efficiency of a node based on ks is consistently more accurate than a method based 
on k in the studied p-range (Fig. (3^)- The C^-based strategy gives poor results compared 
to the other two strategies. 

Our finding is not specific to the social networks shown in Fig. [2J In Si-Section [V] we 
analyze the spreading efficiency in other networks not social in origin, like the Internet at the 
router level [20] , with similar conclusions. The key insight of our finding is that in the studied 
networks a large number of hubs are located in the peripheral low ks layers (Fig. [3b shows 
the location of the 25 largest hubs in the CNI, see also SI-Section|V|) and therefore contribute 
poorly to spreading. The existence of hubs in the periphery is a consequence of the rich 
topological structure of real networks. In contrast, in a fully random network obtained by 
randomly rewiring a real network preserving the degree of each node (such a random network 
corresponds to the configuration model [2l| . see Si-Section IVip all the hubs are placed in 
the core of the network (see the red scatter plot in Fig. [3b) and they contribute equally 
largely to spreading. In such a randomized structure the same information is contained in 
the fc-shell as in the degree classification since there is a one to one relation between both 
quantities which is approximately linear, ks oc k (Fig. [3b and Si-Fig. [TBI . Examples of 
real networks that are similar to a random structure are the network of product space of 
economic goods [22] and the Internet at the AS level (analyzed in the Si-Section IVj) . 

Our study highlights the importance of the relative location of a single spreading origin. 
Next, we address the question of the extent of an epidemic that starts in multiple origins 
simultaneously. Figure [3H shows the extent of SIR spreading in the CNI network when the 
outbreak simultaneously starts from the n nodes with the highest degree k or the highest 
ks index. Even though the high ks nodes are the best single spreaders, in the case of 
multiple spreading the nodes with highest degree are more efficient than those with highest 
ks- This result is attributed to the overlap of the infected areas of the different spreaders: 
large ks nodes tend to be clustered close to each other, while hubs can be more spread in 
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the network and, in particular, they need not be connected with each other. Clearly, the 
step-like features in the plot of highest ks nodes (red solid curve in Fig. |3H) suggest that the 
infected percentage remains constant as long as the infected nodes belong in the same shell. 
Including just one node from a different shell results in a significantly increased spreading. 
This result suggests that a better spreading strategy using multiple n spreaders is to choose 
either the highest k or ks nodes with the requirement that no two of the n spreaders are 
directly linked to each other. This scheme then provides the largest infected area of the 
network as shown in Fig. [3U. 

n 

Many contagious infections, including most sexually transmitted infections |23j, do not 
confer full immunity after infection as assumed in the SIR model, and therefore are suitably 
described by the SIS epidemic model, where an infectious node returns to the susceptible 
state with probability A. In an SIS epidemic the number of infectious nodes eventually 
reaches a dynamic equilibrium "endemic" state where as many infectious individuals become 
susceptible as susceptible nodes become infectious [18]. In contrast to SIR, in the initial 
state of our SIS simulations 20% of the network nodes are already infected. The spreading 
efficiency of a given node i in SIS spreading is the persistence, Pi(t), defined as the probability 
that node i is infected at time t [7|. In an endemic SIS state, pi(t — > oo) becomes independent 
of t (see SI-Section lVHj) . Previous studies have shown that the largest persistence pi(t — > oo) 
is found in the network hubs which are re-infected frequently due to the large number of 
neighbors . However, we find that this result holds only in randomized network 

structures. In the real network topologies studied here, we find that viruses persist mainly 
in high ks layers instead, irrespectively of the degree of the nodes in the core. 

In the case of random networks, it is found that viruses propagate to the entire network 



above an epidemic threshold given by > /3£ and = X(k)/(k 2 ) 24|, |26j]. In real networks, such 
as the CNI network, the threshold /3 C is different from . Furthermore, in real networks, 
we find that viruses can survive locally even when < (3 C , but only within the high ks 
layers of the network, while virus persistence in peripheral ks layers is negligible (Fig. 0^- 
c). Since the fc-shell structure depends on the network assortativity the lower threshold is 
in agreement with the observation that high positive assortativity [23] may decrease the 
epidemic threshold. 

The importance of high ks nodes in SIS spreading is confirmed when we analyze the 
asymptotic probability that nodes of given (ks, k) values will be infected. This probability 
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is quantified by the persistence function 

as a function of (ks,k) at different /? values (Fig. Hk. and b). High k$ layers in networks 
might be closely related to the concept of a core group in Sexually Transmitted Infections re- 
search [23J . The core groups are defined as subgroups in the general population characterized 



by high partner turnover rate and extensive intergroup interaction [231 ]. 

Similar to the core group, the dense sub-network formed by nodes in the innermost k- 
shells helps the virus to consistently survive locally in the inner-core area and infect other 
nodes adjacent to the area. These fc-shells preserve the existence of a virus, in contrast to 
e.g. isolated hubs in the periphery. Note that a virus cannot survive in the degree-preserving 
randomized version of the CNI network, due to the absence of high fc-shells. 

The importance of the inner-core nodes in spreading is not influenced by the infection 
probability values, (3. In both models, SIS and SIR, we find that the persistence p or the 
average infected fraction M, respectively, is systematically larger for nodes in inner fc-shells 
compared to nodes in outer shells, over the entire j3 range that we studied (Fig. Hb,d). Thus, 
the fc-shell measure is a robust indicator for the spreading efficiency of a node. 

Finding the most accurate ranking of individual nodes for spreading in a population can 
influence the success of dissemination strategies. When spreading starts from a single node, 
the kg value is enough for this ranking, while in the case of many simultaneous origins, 
spreading is greatly enhanced when we additionally repel the spreaders with large degree or 
ks- In the case of infections that do not confer immunity on recovered individuals, the core 
of the network in the large ks layers forms a reservoir where infection can survive locally. 



I. METHODS 



A. The /c-shell decomposition 

Nodes are assigned to fc-shells according to their remaining degree, which is obtained 
by successive pruning of nodes with degree smaller than the ks value of the current layer. 
We start by removing all nodes with degree k — 1. After removing all the nodes with 
k — 1, some nodes may be left with one link, so we continue pruning the system iteratively 
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until there is no node left with k — 1 in the network. The removed nodes, along with the 
corresponding links, form a A;-shell with index ks = 1. In a similar fashion, we iteratively 
remove the next /c-shell, ks = 2, and continue removing higher /c-shells until all nodes are 
removed. As a result, each node is associated with a unique ks index, and the network can 
be viewed as the union of all /c-shells. The resulting classification of a node can be very 
different than when the degree k is used. 

B. The spreading models 

To study the spreading process we apply the Susceptible-Infectious- Recovered (SIR) and 
Susceptible-Infectious-Susceptible (SIS) models. In the SIR model, all nodes are initially 
in susceptible state (S) except for one node in the infectious state (I). At each time step, 
the I nodes attempt to infect their susceptible neighbors with probability (3 and then enter 
the recovered state (R) where they become immunized and cannot be infected again. The 
SIS model aims to describe spreading processes that do not confer immunity on recovered 
individuals: infected individuals still try to infect their neighbors with probability j3 but 
they return to the susceptible state with probability A (here we use A = 0.8) and can be 
reinfected at subsequent time steps, while they remain infectious with probability 1 — A. 

C. The imprecision function 

The betweenness centrality, C B (i), of a node % is defined as follows: Consider two nodes 
s and t and the set a st of all possible shortest paths between these two nodes. If the subset 
of this set that contains the paths that pass through the node i is denoted by a st (i), then 
the betweenness centrality of this node is given by: 



where the sum runs over all nodes s and t in the network. 

The imprecision function e(p) quantifies the difference in the average spreading between 
the pN nodes (0 < p < 1) with highest ks, k, or Cb from the average spreading of the pN 
most efficient spreaders (N is the number of nodes in the network). Thus, it tests the merit 
of using /c-shell, k and Cb to identify the most efficient spreaders. For a given f3 value and 




(3) 
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a given fraction of the system p we first identify the set of the Np most efficient spreaders 
as measured by Mj (we designate this set by T c g). Similarly, we identify the Np individuals 
with the highest /c-shell index (T ks ). We define the imprecision of fc-shell identification as 
e ks {p) = 1 — Mfc s /M e fj, where M ks and M e g are the average infected percentages averaged 
over the T ks and T eff groups of nodes respectively. e k and ec B are defined similar to e ks - 
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FIG Q3 When the hubs may not be good spreaders, a, A schematic representation 
of a network under the /c-shell decomposition. The two nodes of degree k = 8 (blue and 
yellow nodes) in this network are in different locations: one lies in the periphery, (kg = 1) 
while the other hub is in the innermost core of the network, i.e. it has the largest k$ (kg = 3). 
b-d, The extent of the efficiency of the spreading process cannot be accurately predicted 
based on a measure of the immediate neighborhood of the node, such as the degree k. For 
the contact network of inpatients (CNI), we compare infections originating from single nodes 
having the same degree k = 96 (nodes A and B) or the same index ks = 63 (nodes A and 
C), with infection probability /3 = 0.035. In the corresponding plots, the colors indicate the 
probability that a node will be infected when spreading starts in the corresponding origin, 
as long as this probability is higher than 25%. The results are based on 10000 different 
realizations for each case. In the first case, where origin A has k$ = 63, spreading reaches 
a much wider area more frequently, in contrast to origin B (ks = 26), where the infection 
remains largely localized in the immediate neighborhood of B. Spreading is very similar 
between origins A and C, which have the same ks value, although the degree of C is much 
smaller than A. The importance of the network organization is also highlighted when we 
randomly rewire the network (preserving the same degree for all nodes). In this case the 
standard picture is recovered: the extent of spreading coincides and both hubs contribute 
equally largely to spreading (see Si-Section IVIj) . 

FIG O The fc-shell index predicts the outcome of spreading more reliably 
than the degree k or the betweenness centrality Cb- The networks used are (top to 
bottom): email contacts (f3 = 8%), CNI network ((3 = 4%), the actors network (j3 = 1%), 
and the Livejournal.com friendship network (j3 = 1.5%). a, c, e, g Average infected size of 
the population M(ks,k) when spreading originates in nodes with (ks,k). b, d, f, h The 
infected size M(ks,Cs) when spreading originates in nodes of a given combination of ks 
and Cb- In both cases, spreading is larger for nodes of higher ks, while nodes of a given 
k or Cb value can result in either small or large spreading, depending on the value of ks- 
(There is an exception at large ks and small k of the livejournal database, which is due to 
artificial closed groups of virtual characters that connect with each other for the purpose of 
online gaming and do not correspond to regular users, as the rest of the database.) 

FIG [31 A;-shell structure of the CNI network, a, The imprecision functions €k s (p), 
€k(p), and €c B (p), for j3 = 4%. Even though both fc-shell and k identification strategies 
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yield comparable results for p = 2%, the /c-shell strategy is consistently more accurate for 
2% < p < 10% with €k s approximately twice lower than e^. The Cb identification of the 
most efficient spreaders is the least accurate, with ec B exceeding 40%. b, We visualize 
the CNI network as a set of concentric circles of nodes representing inpatients, each circle 
corresponding to a particular /c-shell. The k$ indices of a given layer increase as one moves 



from the periphery to the center of the network |28|, |29|]. Node size is proportional to the 
logarithm of the degree of the node. We highlight the 25 inpatients with the largest degree 
values. Note that inpatients with high k values are not concentrated at the "center" of the 
network but instead are scattered throughout different A;-shells. We highlight the position of 
the three nodes A, B, and C, of the origins that were used in the example of Fig. [TJ c, Scatter 
plot of the node degree k as a function of k$ for all the nodes in the CNI network (black 
symbols) and the degree-preserving randomized version of the same network (red symbols). 
Note that there are many inpatients with large k and low k$ values in the original network 
while in the randomized email network all the hubs are located in the inner core of the 
network. We also show the position of the three origins used in Fig. [TJ d, When spreading 
starts from multiple origins, the set of nodes with highest degree (blue continuous line) can 
spread significantly more than the set of highest-Zcg nodes (red continuous line), because in 
the latter case most of these nodes are connected to each other. If we only consider in this 
set nodes that are not directly linked, then both the sets of highest k or k$ nodes yield a 
similar result (dashed lines), where spreading is significantly enhanced. Results are shown 
for (3 = 3% in the CNI. 

FIG [H SIS spreading in the CNI network and (3 dependence for SIS and 
SIR. a, b, Virus persistence p(ks, k) as a function of k and k$ values of inpatients in the 
CNI network for, (3 = 2%, and (3 = 4%, respectively, where 20% of the individuals are 
initially infected. The infection survives mainly in nodes with large ks values, c, We form 
four groups of nodes of the CNI network based on their fc-shell values. For all values of /3, 
virus persistence is consistently higher in the inner fc-shells. d, Influence of the infection 
probability (3 on the spreading efficiency of nodes, grouped according to their /c-shell values, 
for SIR spreading. The solid black line refers to the average infected percentage over all 
network nodes. Nodes in higher fc-shells are consistently the most efficient, independently 
of the (3 value. 
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FIG. 2: 
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FIG. 4: 
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Identifying influential spreaders in complex networks 

SUPPLEMENTARY INFORMATION 



I. DATASETS 

In this study we have mainly focused on social networks, but our results can be extended 
to networks from practically any discipline. The datasets that were used in the paper and 
in this Supplementary Information are the following: 



a) Contact Network of Inpatients. We use records from Swedish hospitals [16[ and estab- 
lish a link between two inpatients if they have both been hospitalized in the same quarters. 
We restrict the recording period to one week. All the data have been handled in a de- 
indentified form. There are 8622 inpatients in the largest component, with an average 
degree of around 35.1. 

b) IMDB actors in adult films. We have created a network of connections between actors 
who have co-starred in films, whose genre has been labeled by the Internet Movie Database 



17j | as 'adult'. This network is a largely isolated sub-set of the original actor collabora- 
tion network. Additionally, all these films have been produced during the last few decades, 
rendering the network more focused in time. The largest component comprises 47719 ac- 
tors/actresses in 39397 films. The average degree of the network is 46.0. 

c) Email Contact Network. The network of email contacts is based on email messages 
sent and received at the Computer Sciences Department of University College London. The 
data have been collected in the time window between December 2006 and May 2007. Nodes 
in the network represent email accounts. We connect two email accounts with an undirected 
link in the case where at least two emails have been exchanged between the accounts (at 
least one email in each direction). There are 12701 nodes with an average degree of 3.2. 

d) LiveJournal.com. The network of friends in the Live Journal community, as recorded 
in a 2008 snapshot. We only consider reciprocal links, i.e. when two members are in each 
other's list of friends. There are 3453394 nodes in the largest component, and the average 
degree is 12.4. 

e) Cond-mat collaboration network. This is the network of collaborations between scien- 
tists that have posted reprints in the 'cond-mat' e-print archive, between 1995 and 2005. The 
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Contact Network of Inpatients 


8622 


151649 


35.1 


1633 


1.7% 


4% 


66 


Actor Network 


47719 


1028537 


46.0 


17483 


0.21% 


1% 


199 


Email Contacts 


12701 


20417 


3.2 


351.1 


0.73% 


8% 


23 


Live Journal 


3453394 


21378154 


12.38 


892.45 


1.1% 


1.5% 


100 


Cond-mat Collaboration Network 


17628 


52884 


7.0 


109.4 


5.1% 


10% 


22 


RL Internet 


493312 


808844 


3.3 


71.9 


4.6% 


6% 


36 


AS Internet 


20556 


62920 


6.1 


2111.2 


0.23% 


n/a 


41 


Product Space 


765 


40164 


104.8 


16931 


0.50% 


n/a 


100 



TABLE I: Properties of the real-world networks studied in this work. Here N is the number of 
nodes, Ne is the number of edges, < k > is the average degree in the network, < k 2 > is the 
average squared degree in the network, /3J and is the epidemic threshold for a corresponding random 
network (/3^ and ~ A < > / < k 2 >), A = 0.8 in SIS simulations, j3 is the value we used in SIR 
simulations and ks max is the highest fc-shell index of the network. We consider only the largest 
connected cluster of the network if the original network is disconnected. 



nodes of the network represent the authors, who are connected if they have co-authored at 
least one paper. The cond-mat collaboration dataset consists of 17628 authors with average 
degree 6.0 

f) The Internet at the router level (RL). The nodes of the RL Internet network are the 
Internet routers. Two routers are connected if there exists a physical connection between 
them. Data have been gathered from the DIMES project 13j. The largest connected 
component of the analyzed dataset contains 493312 routers with an average degree of 3.3. 

g) The Internet at the autonomous system level (AS). The nodes are autonomous systems 
which are connected if there exists a physical connection between them. An autonomous 
system is a collection of connected IP routing prefixes under the control of one or more 
network operators that presents a common, clearly defined routing policy to the Internet. 
Data have been gathered by the DIMES project [13]. The largest connected component of 
the AS Internet consists of 20556 autonomous systems with average degree 6.1. 

h) Product space of economic goods. This is the network of proximity between products 
according to Ref. 22j. We use a proximity threshold 0.3, and we recover similar results for 
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different thresholds, as well. 

We outline some of the basic properties for these networks in Table [B 



II. THE fc-SHELL DECOMPOSITION METHOD 

In order to classify the nodes into fc-shells we employ the fc-shell decomposition algorithm. 
First, we remove all nodes with degree k=l. After this first stage of pruning there may 
appear new nodes with k—1. We keep on pruning these nodes, as well, until all nodes with 
degree k=l are removed. The removed nodes along with the links connecting them form the 
ks = 1 fc-shell. Next, we repeat the pruning process in a similar way for the nodes of degree 
k=2 to extract the ks = 2 /c-shell and subsequently for higher values of k until all nodes are 
removed. As a result, the network can be viewed as a set of adjacent fc-shells (see Fig. [5]). 




FIG. 5: The illustration of the k-shell extraction method, a, A schematic network is 
represented as a set of 3 successively enclosed fc-shells labeled accordingly, b, Nodes with edges 
forming ks = 1 shell of the network, c, Nodes with edges forming ks = 2 shell of the network, d, 
Nodes with edges forming ks = 3 shell of the network. 



The fc-shell decomposition method assigns a unique k s value to each node, that corre- 
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sponds to the index of the fc-shell this node belongs to. The ks index provides a different 
type of information on a node than that provided by the degree k. By definition, a given 
layer with index ks can be occupied with nodes of degree k > ks- In the case of random 
model networks, such as the configurational model, there is a strong correlation between k 
and the ks index of a node and, therefore, both quantities provide the same type of informa- 
tion. Thus, the low-degree nodes are generally in the periphery, and the high-degree nodes 
are generally in the innermost fc-shells. In real networks, however, this relation is often not 
true. In real networks hubs may have very different ks values and can be located both in 
the periphery (yellow node in Fig. ED or in the core (blue node in Fig. \5§ of the network. 
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FIG. 6: Robustness of ks under incomplete network information. We randomly remove 
10% of the network links and 50% of the network links (results shown in black and red symbols, 
respectively). The relative ranking of the nodes remains invariant under both removals, for all the 
networks studied: Email, Hospital, Adult IMDB, and Livejournal.com. 

The assignment of a ks index to a node is also quite robust. We have randomly removed 
10% and 50% of the links in the networks that we study, simulating thus incomplete infor- 
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mation. When we measure the new ks value for the same nodes in the resulting networks 
(Fig. [6]) we find that their relative ranking remains the same. We recover a practically linear 
dependence on the kg values of the original and the incomplete networks, showing that this 
measure would work equally well for predicting the spreading efficiency of nodes in a network 
with missing information. 

III. PROBABILITY AND TIME OF INFECTION 

We have demonstrated that the location of a node, as described through the k s index, is 
important for the extent of spreading Mj when this node is the spreading origin. Here, we 
show that nodes with high ks are more probable to be infected during an epidemic outbreak 
and are infected earlier than nodes with low ks, when spreading starts at a random node. 
We introduce the quantity E i} as the probability that a node i is going to be infected during 
an epidemic outbreak originating at a random location, and Tj, as the average time before 
node i is infected during the same process. 

As shown in Figs. [7h-d all three quantities that characterize the role of a node in an 
epidemics process, Mj, Ei and Tj are strongly correlated. The nodes that are infected by a 
given node i form a cluster of size Mi, and they are statistically the nodes that can reach i 
when they act as origins themselves. Thus, the probability Ei to reach this node in general 
is directly proportional to the size Mj, as shown in the plots. The average time to reach a 
node is inversely proportional to its spreading efficiency Mj, which emphasizes the fact that 
these nodes are easily reachable from different network locations. In conclusion, the nodes 
with the largest ks values consistently a) are infecting larger parts of the network, b) are 
infected more frequently, and c) are infected earlier, than nodes with smaller ks values. 

IV. THE IMPRECISION FUNCTIONS 

We quantify the spreading efficiency of an individual origin i through the infected number 
of nodes Mj. In order to compare the different methods, we rank all network nodes according 
to their spreading efficiency, independently of their other properties, and we consider a 
fraction p of the most efficient spreaders (p G [0,1]). We designate this set by T e ff(p). 
Similarly, we define Tfc s (p) as the set of individuals with highest fc-shell values. In order to 
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T T 

FIG. 7: Cross-plots of Mi as a function of Tj, and Mj as a function of i?j (inset) for a) email, b) 
hospital inpatients, c) actor network and d) RL Internet. Every point denotes the corresponding 
quantities for a given node, and the color denotes the /c-shell index of this node. The ks values 
are aggregated and highlighted with red (large ks regime), green (intermediate k$ regime) and 
blue (low ks values) colors, respectively. A high level of correlation between Mj and £j indicates 
that the most efficient spreaders (as measured by Mj) are the most likely to be infected during 
an epidemic outbreak originating at random inpatient in the network. On the other hand, the 
anti-correlation between Mj and Tj indicates that the most efficient spreaders are typically infected 
earlier than other nodes during an epidemic outbreak. 

assess the merit of using fc-shell decomposition to identify the most efficient SIR spreaders 
one needs to compare the two sets T e ff(p) and Tfc s (p). In order to consider individual Mj 
values, we calculate the average M e ff(p) and Mk s (p) values for the sets T e ff(p) and Tk s (p) 
respectively: Mk s (p) = Xljex fc (p) 

Mi/Np and M eff (p) = £ ie T e //(p) Mi/Np, where Np is 
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FIG. 8: The imprecision functions e(p) test the merit of using fc-shell, and Cb to 
identify the most efficient spreaders in the CNI, actor, collaboration, and email contact 
networks. The fc-shell based identification method yields consistently lower imprecision compared 
to the k and Cb based methods. 

the number of nodes that we consider in the comparison. By definition, M e ff(p) > M^ s (p), 
and the equality is only reached if T e //(p) = Tk s (p)- We assess the imprecision of fc-shell 
identification by calculating the ratio between M e //(p) and Mk s (p): 



M ks (p) 



(4) 



Similarly, we can define e&(p) and ec B {p)'- 



M k {p) M CB {p) 
e k {p) = 1 - — — — , ec B {p) = 1 - — — ^r. 

M eff{P) Meffip) 



(5) 



A value for e close to denotes a very efficient process, since the nodes that are chosen are 
practically those that contribute most to epidemics. In all cases, the k s method yields a 



spreading that is closer to the optimum than either the degree or the betweenness centrality. 
Additionally, this behavior is independent on the fraction of spreaders p that we consider in 
each case. 




FIG. 9: The shell index ks predicts the outcome of spreading more reliably than the 
degree k or the betweenness centrality Cb- The networks that were analyzed are: (a, b) the 
RL Internet and (c, d) the collaboration network, a and c, The average infected size M(ks,k) 
as a function of (ks,k) values of the infection origin nodes, b and d, The average infected size 
M(ks, Cb) as a function of (ks,Cg) values of the infection origin nodes. 



V. SIR SPREADING EFFICIENCY 

In the main text we present results for M(kg, k) for the email network, the CNI, the 
actor network and the Livejournal network. Here, we present additional results of the fc-shell 
analysis of the Internet at the Router Level (RL) and the scientific collaboration network. 
Figure M shows the results for M(ks,k) and M(ks,Cs)- The conclusion on the spreading 
importance of high ks nodes is exactly the same as for the social networks in the main text. 

The results on the nodes efficiency are not significantly influenced by the choice of the 
infected probability value, 0. In Fig. [10] we present the infected percentage M for different 
networks, as an average over nodes that belong in the same ks range, for different values. 
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FIG. 10: The infected percentage is always higher in higher fc-shells, independently 
of the infection probability (5. Nodes are grouped according to their fc-shell and we calculate 
the average infected percentage for each group as a function of (3. The solid lines correspond to 
the grand average over all nodes acting as spreading origins. The networks that were analyzed 
are: a, the email network, b, the CNI, c, the adult IMDB actors network, and d, the cond-mat 
collaboration network. 



The nodes in higher fc-shells are consistently reaching a larger fraction of the network. Our 
main interest is in the range where we are above the critical point, (M) > 0, but the 
average infection reaches a finite but small fraction, in the range of 1-20%. When the 
average spreading is even larger, nodes of lower fc-shells can become efficient too, because in 
this case there is a high probability to reach the 'core' of the network, and this would enable 
the spreading to extend over an even larger part of the network. 

For (3 values in this 'intermediate' range, the distribution P(M) of the infected percentage 
M is composed by two well-defined peaks (Fig. [TT1) . The first is at M = and corresponds 
to those instances where the infection dies within the first few infection steps. The second 
peak is at a finite fraction M, and it seems to be at the same point for all origins. However, 
the intensity of each peak strongly differs, depending on the ks value of the origin. For 
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FIG. 11: Distribution of spreading based on individual origins. The probability distribution 
P(M) of the infected percentage for the contact network of inpatients, when the epidemic starts at 
four nodes of different properties. The infection probability is fi = 4%, which is above the critical 
threshold. All distributions exhibit two peaks at similar ranges every time, i.e. around M = 
(epidemics dies very fast) and M ~ 33%. However, the intensity of each peak differs, and in higher 
A>shells the majority of the realizations result in large infections, compared to the much higher 
ratio of zero-spreading realizations for origins of small ks values. 

the higher ks value in the plot, the stronger peak is at the non-zero value, and very few 
realizations end up at M = even for smaller degrees. On the contrary, an origin with larger 
degree k, but smaller ks value results in a stronger peak at M — 0. These distributions 
converge quite well, and we can expect that nodes with small ks will in general result in a 
higher peak at M — 0. The above means that if an infection can reach a critical mass of 
nodes then it will eventually cover a significant part of the network. The low /c-shell nodes 
cannot reach this critical mass so that the infection dies at the early stages, resulting to the 
strong peak at M = 0. On the contrary, the neighborhood of high fc-shell nodes is favorable 
for sustaining an infection at early stages, allowing the system to reach this critical mass. 
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FIG. 12: fc-shell structure of the analyzed networks. (Top row): Visualization of the fc-shell 
structure. We represent networks as sets of concentric circles of nodes, each one corresponding 
to the particular A;-shell, with low ks values in the periphery and large ks values towards the 
center of the network. The size of each visualized node is proportional to the logarithm of its 
degree value. We highlight the 25 highest degree nodes with black squares. Many of the hubs are 
found in outer layers. (Bottom row): Scatter plots of node degree A: as a function of its /c-shell 
index ks for the original networks (black symbols) and the degree-preserving randomized version 
of the networks (red symbols). The networks cpprespond to: the cond-mat collaboration network, 
the actor network, the email contact network, the RL Internet, the AS Internet, and the Product 



We also highlight the location of the 25 largest hubs in the fc-shell structure of the studied 
networks. Fig. [12] shows the results for the collaboration, actor, email, RL Internet, AS 
Internet, and Product space networks. High-degree nodes in most of the studied networks 
are scattered at different /c-shells: the high-/c nodes appear both in the periphery (starting 
as low as k$ = 1) and in the network center (large k$ value). In certain cases, such as in 
the actors network, the largest hubs are located in the highest ks layers. The relation of ks 
and k in the AS Internet and the product space is strongly monotonic, and there are very 
few nodes where ks is large or small compared to the degree k. This is a typical behavior 
for random networks, and the structure of these two networks is significantly close to their 
randomized counterparts. In these cases, choosing a node based on its degree or its fc-shell 
index does not make a difference, since they practically lead to the same nodes. 




FIG. 13: Deviations from the average behavior highlight the importance of the /c-shell 
structure. The average degree (red symbols) for a given ks index follows roughly a power-law 
dependence, as a function of ks- The deviation from this behavior can be significant, e.g. in RL 
internet, or negligible, as e.g. in the product space network. 

It is clear that the assortative behavior in a network can influence the extent to which 
hubs will appear in the periphery or in the core of a network. In principle, in a highly 
disassortative network we expect more hubs in the periphery, due to their tendency to 
connect to low-degree nodes. However, even in assortative networks it is possible that some 
hubs may still belong to low fc-shells, so that the ks value will appropriately rank even these 
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exceptions. The average degree of the nodes in a specific shell follows roughly a power law 
with ks (Fig. [T3|) . The deviations from this average behavior emphasize the importance of 
spreaders within the core of the network having high values of ks and potentially smaller 
degrees, than those with high k and low ks values. 

The complex organization of the nodes in the fc-shells is highlighted when we randomly 
rewire the links in the networks, yet preserving the nodes degree. This rewiring 'restores' all 
the hubs to the innermost /c-shell of the system and imposes a strict hierarchy of nodes in 
terms of both k and ks- The bottom row of plots in Fig. [121 shows the scatter-plots of degree 
fcasa function of fc-shell index ks for every node in the network. In all cases, a monotonic 
relation of k vs ks is followed in the 'rewired' networks (red symbols), where now all the 
hubs appear in the highest fc-shell) as opposed to the weak correlation between k and ks in 
the original networks (shown in black). 



VI. REWIRING HIGHLIGHTS THE IMPORTANCE OF /c-SHELL 

In Figs, la and lb of the main text we show that the extent of infection can be remark- 
ably different, although we start from two origins with similar degree. The importance of 
the structure in the dynamics of spreading can be highlighted if we randomly rewire the 
network. During this process the original degrees of all nodes are preserved, but random 
neighbors are chosen for each node, destroying thus any correlations and any patterns in 
the local connectivity. We denote by P(M\i) the probability that a percentage M of the 
total population will be infected if a disease originates on node i. In Figs. la,b of the main 
text and in FigJT^b we show that two nodes #1 and #2 with similar degree may yield 
markedly different distributions P(M\1) and P(M\2). After rewiring, these distributions 
become practically indistinguishable (see Fig. [T4b). 



VII. VIRUS PERSISTENCE IN SIS 



Many infectious diseases, including most sexually transmitted infections, do not confer 
immunity after infection, so that they cannot be described via the SIR model. These cases 
are better simulated through the SIS epidemic model [18j. The dynamics of SIS epidemics 
is different, since the number of infected nodes eventually reaches a dynamic equilibrium 
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FIG. 14: Why the hubs may not be good spreaders. The probability distribution P{M\i) 
of the infected percentage for the contact network of inpatients, when the epidemic starts at two 
of the origin hubs in Fig. [T] i = A,B with the same degree (k = 96), but different k$ values 
(ks = 63 and k$ = 26, respectively). In each histogram, we use 1000 random realizations of the 
simulation, starting an SIR epidemic from the same given origin i. Despite the fact that the two 
origins of the epidemic spreading have the same degree, the two histograms present a radically 
different character. In one case (red histogram), the hub infects up to 30% of the population, 
while most of the spreading attempts from the other hub (yellow histogram) practically cannot 
propagate the infection at all. The importance of the organization of the network is highlighted 
when we randomly rewire the network (preserving the same degree for all nodes). In this case both 
distributions P{M\A) and P{M\B) coincide and both hubs contribute equally to spreading. Notice 
also that spreading in the rewired network extends over a much larger size of the population. 

"endemic" state at which exactly as many infectious individuals become susceptible as sus- 
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ceptible nodes become infected [18j. The quantity characterizing the role of nodes in SIS 
spreading is the persistence, Pi(t), defined as the probability that node i is infected at time 
t In an endemic SIS state, which is reached asymptotically, p, becomes independent of 
t. The persistence p has been shown to be higher in hubs which are reinfected frequently 
due to the large number of their neighbors [3, Q, 25 1. To uncover the role of /c-shell layers 



in SIS spreading we use the joint persistence function 

Here we present results for the virus persistence in the Actor, Collaboration, Email and 
RL Internet Networks. Similar to Fig. H] we depict p(ks, k) in both supercritical (/3 > /3 C ) 
and subcritical (/3 < (3 C ) regimes, where /3 C is the critical threshold. In the supercritical 
regime, p(k, ks) increases with both k and k$, with maximum values corresponding to hubs 
in the innermost layers (see Fig. [T5l) . As depicted in Fig. [15], in the subcritical regime, viruses 
persist only in the highest ks layers, while the probability of finding an infected node in low 
fc-shells is negligible. 

In order to determine in the above networks the actual epidemic threshold (3 C we study 
the behavior of SIS spreading over a wide range of (3 values. In order to highlight the 
role of fc-shells in spreading, we organize several groups of nodes based on the ks layers of 
each network. Every such group comprises approximately 100 randomly chosen nodes with 
the corresponding fc-shell indices. In order to achieve similar average degree in each of the 
groups, we pick nodes with uniform probability based on their degree. As shown in Fig. ITB], 
virus persistence is consistently higher in the inner fc-shells for all values of /3. Moreover, 
we find substantially lower epidemic thresholds than in the random cases /3 C < /3J and in all 
considered networks except for the Email Contact network. 

The results of Figs. [15] and [16] suggest that the observed persistence of a virus is due to 
the dense sub-network formed by nodes in the innermost fc-shell, which helps the virus to 
consistently survive locally in this area. Indeed, the innermost layers can be regarded as 
a small subgraph exclusively consisting of hubs. By definition, all nodes in this innermost 
fc-shell will have degrees k > ks max . Therefore, as a simple approximation, one can regard 
the innermost core of a network as a regular graph consisting of nodes with the same degree 
k = kc 

>Jmax 

The mean-field solution of the SIS spreading in a regular graph can be found, for instance 
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in Ref. [24|. We reproduce this solution below for the sake of convenience. 

The master equation describing the time evolution at a mean-field level of the average 
density of infected individuals pit): 

dp(t) 



dt 



-p(t) + f3kp(t)(l-p(t)), (7) 



where k is the degree of all nodes in the regular graph. The first term on the right hand 
side of Eq. (GO) accounts for infected nodes becoming healthy. The second term on the right 
hand side of Eq. (|7|) accounts for healthy nodes becoming infected: a randomly chosen node 
is healthy with probability 1 — p{t), this healthy node can be infected by either of its k 
neighbor nodes with total probability of (5kp{t). The stationary endemic state is reached 
when dp(t) /dt = which leads to 

indicating the existence of a nonzero epidemic threshold of — 1/k. The innermost core of 
a network consisting only of nodes with degrees k > kg max will have epidemic threshold 

& < lAw (9) 

The above inequality holds for all considered networks. Moreover, this inequality becomes 
an equality for CNI and collaboration networks where nearly all nodes in the innermost 
cores have degree k ~ ks max . 
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FIG. 15: SIS maps 
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FIG. 16: How average SIS persistence in different fc-shells depends on virus conta- 
giousness. For every network we randomly sample several groups of nodes based on /c-shell index 
(as described in SI) . We plot the average virus persistence p for every group of nodes as a function 
of /3 for the Email, Actor, Collaboration and RL Internet networks. Virus persistence is higher for 
nodes located in higher fe-shells. 
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